-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for propolis-based softnpu device, fix multi-switch uplink updates. #4390
Conversation
c537e30
to
85899eb
Compare
illumos-utils/src/svc.rs
Outdated
} else { | ||
if let Some(zname) = zone { | ||
if let Err(out) = | ||
tokio::process::Command::new(PFEXEC) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put this here because services do not come up reliably when Omicron is running in a VM. I'm not sure if this is something we want in the mainline code to deal with early zone service failure or not. AFAICT this is only being called when waiting on zones to come to milestone/single-user
. @smklein @andrewjstone @davepacheco.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut is this is definitely not something we want to be doing under normal operation. If things are going into maintenance, I'd expect we need to figure out why and fix the underlying issues. Without knowing what those are, it's hard to know we're not papering over a more significant issue or one that could come up again later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've put this behind a compilation flag. In order for it to get compiled into the code
RUSTFLAGS="--cfg svcadm_autoclear"
is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we use that flag anywhere or is that just for development environments where people happen to be seeing this behavior? Are you seeing this problem (services going into maintenance during startup) frequently?
I'm a little worried because to my knowledge we have not seen this before. Might this change cause us to see this problem elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The flag is a one-off that one has to explicitly set for this code to be present. The goal is to make it hard to accidentally enable this.
I'm working in a multi-switch falcon topology that hosts the control plane. This topology is composed of propolis VMs interconnected by simnet links. In this environment I get many services going into maintenance mode every time I launch the control plane. The number and frequency of services going into maintenance on startup is high enough to make it very time-consuming to find and clear them all by hand. With this little workaround, things mostly come up automatically as the issues all appear to be transient in nature.
I'm a little worried because to my knowledge we have not seen this before. Might this change cause us to see this problem elsewhere?
We started seeing this when I started running the control plane in topologies composed of VMs. Why this happens in VMs, I'm not sure - but I don't think anything here would cause this issue to manifest elsewhere. I think it's a bad interaction at the intersection of the control plane, SMF and running on virtual hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for that context! That makes sense. I think it's almost always a bug for any SMF service to go into maintenance on our systems. If you haven't already, would you mind filing an issue for that (even if it's not high priority)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I filed #4399 on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
// TODO: unfortunately this test can no longer be run in the integration test | ||
// suite because it depends on communicating with MGS which is not part | ||
// of the infrastructure available in the integration test context. | ||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we unable to use the faux MGS here (the same one we use to simulate MGS for the switch zone with SoftNPU)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sure that can be made to work, but it's not a part of the current integration test infrastructure and not something I have the bandwidth to take on at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally understandable. It may be something that fits into the scope of work for the next set of RPW PRs, I was curious if there were any blockers I wasn't aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what it's worth, as of (I think it was) #4291, the test suite starts a copy of MGS backed by simulated SPs. (I haven't looked at what this test requires so I'm not sure that will work but it may be an option.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I'll give this a shot when I merge main back into this branch.
Depends on: